Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)#1435
Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)#1435AbhayAnandUCSD wants to merge 10 commits intoopenai:mainfrom
Conversation
Adopt PR openai#1421's proven depth recurrence script (1.0925 BPB) as base, with optional BigramHash enhancement. Target ~1.09 BPB to beat merged SOTA (1.1147).
3-seed mean 1.0980 BPB (std 0.0008), beating merged SOTA (1.1147) by 0.0167. Depth recurrence on layers 4,5 (13 virtual from 11 physical), BigramHash(1536, 112), EMA 0.9965, GPTQ int6 + Brotli. ~14.6 MB artifact.
11-task plan for re-running exp4 BigramHash + depth recurrence with SP4096 tokenizer. Includes retokenization on-pod, 3-seed training, and separate PR creation.
…text - Logged 4 experiments: smoke test, JEPA 1xH100, baseline 1xH100, JEPA 8xH100 (interrupted) - Updated open PRs: SP8192 stack now at 1.078 BPB (PR openai#1437) - Revised depth recurrence from dead-end to viable (PR openai#1394, openai#1435) - Updated strategy: Phase 1 = JEPA on PR openai#1019, Phase 2 = rebase on SP8192 - Updated blockers: grant submitted, all pods terminated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…remove non-submission files - Add Reproduction section with torchrun command to README - Add GPTQ calibration note (AR self-generated, not validation data) - Fix submission.json: precise val_bpb/val_loss, correct track format - Remove step_stop (ambiguous across seeds) - Remove docs/superpowers/ and experiments/ (not part of submission)
EMA_DECAY envvar (default=0.997, sota_32 uses 0.9965): - PR openai#1435 shows EMA=0.9965 beats 0.997 by +0.017 BPB (1.0980 vs 1.1147) - args.ema_decay_param wired to replace hardcoded 0.997 RECUR_LAYERS=4,5 at step 3000 (PR openai#1435): - 13 virtual layers from 11 physical (vs 3,4,5 = 14 virtual) - PR openai#1435 config: activate at step 3000 SLOT code present but DISABLED (SLOT_ENABLED=0 by default): - eval_val_slot(), forward_hidden(), compute_logits() added to train_gpt_sota_28.py - SLOT is retroactive 2-pass: optimizes delta on same tokens it scores = not causal - All SLOT PRs (openai#1313, openai#1488) remain unmerged Expected: ~1.095-1.10 BPB (WD=0.04 + EMA=0.9965 + RECUR PR#1435 config)
Community Review — Record: 11L Depth Recurrence + BigramHash + EMA 0.9965 — val_bpb 1.0980 (3-seed mean)BPB: 1.0980 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1571 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 3.18s, dim=512, layers=11, vocab=1024, code=86033 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 3.18s, dim=512, layers=11, vocab=1024, code=86033 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Results (3 seeds, 8xH100 SXM)
Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0167 BPB.
BigramHash vs Vanilla Comparison
BigramHash adds ~0.001 BPB improvement at ~270KB artifact cost.
Attribution